Research workflow with confidential data:
The experience of BPLIM

BPLIM Team

2023-12-18

BPLIM Goal


Promote empirical studies of the Portuguese economy using microdata

  • Grant free access to fully documented ready-to-use or customized data sets

  • Give access to great computational power and software

  • Provide technical, scientific, and computational support

  • Support the development of young researchers

  • Ensure the confidentiality and security of the microdata

  • Guarantee the reproducibility of the results

Data confidentiality

  • Micro data sets have different levels of confidentiality: low, medium, and high

  • If low:

    • access to anonymized data in BPLIM server
  • If medium or high:

    • Modified data is made available in BPLIM server
    • Dummy files are available on BPLIM server or own machine

Data sets: Characteristics

  • Data are always Anonymized: data sets are stripped of elements that allow for direct and indirect identification of companies or individuals.

  • Data sets contain unique unit identifiers common across data sets within each project (e.g.: tina and bina)

  • Data sets are based on a data extraction (“data freeze”) at a specific point in time

  • Labels are applied to all variables and value labels to all categorical variables (in PT and EN)

  • Detailed Manuals and metadata for all data sets

  • Data sets have registered DOIs and bib files for correct citation

  • Data sets are stored in efficient ways that minimize file size

BPLIM server

  • Remote access from anywhere (using a safe connection with “No Machine”)

  • Inability to transfer, download, copy, paste or print data

  • Containers for operating system and statistical packages

  • Available software: R, Stata, Python, and Julia

  • Templates to help structure the code

  • Git for Version Control

BPLIM websites


What happens in BPLIM side?

  • After the Research Project is approved an Account is opened in the Server

  • Data for the project is prepared and placed in the server

  • Prepare the container according to researcher specifications

  • Meet with the researchers to guide them through the Server

Dummy Data: What is different?

  • BPLIM creates the account for the project and place the original files in the initial_dataset.


External researchers do not have access to this account.


  • Next, creates dummy files making use of the following in-house developed tools:
    • mdata for handling metadata (available on BPLIM Github);
    • Dummyfi for creating Dummy Data sets based on the metadata of the original data sets (soon to be available).

Dummy data: the process

  • Extract metadata from original data sets;
  • Discuss the sampling strategy with the researcher (identification of key variables and sample size);
  • Create a random sample of the ids for each file. This is done in a way that preserves links across data sets;
  • Generate the code to produce the dummy files;
  • Deliver the package to generate the dummy data sets;
  • Meet with the researcher to explain how to use the package.

Dummy Data

List of Metadata Files Generated

Dummy Data Tool

Example of a Metadata File

Dummy Data Tool

List of Scripts to generate Dummy Data

Dummy Data Tool

Script to generate Dummy Data

Thank you!